arXiv:cs/0112007v2 [cs.DB] 30 Nov 2002 


A tight upper bound 
on the number of candidate patterns* 

Floris Geerts, Bart Goethals, Jan Van den Bussche 
University of Limburg, Belgium 


Abstract 

In the context of mining for frequent patterns using the standard 
levelwise algorithm, the following question arises: given the current 
level and the current set of frequent patterns, what is the maximal 
number of candidate patterns that can be generated on the next level? 
We answer this question by providing a tight upper bound, derived 
from a combinatorial result from the sixties by Kruskal and Katona. 
Our result is useful to reduce the number of database scans. 


*A preliminary report on this work was presented at the 2001 IEEE International 
Conference on Data Mining 



1 Introduction 


The frequent pattern mining problem is by now well known. We are given 
a set of items X and a database T> of subsets of X called transactions. A 
pattern is some set of items; its support in V is defined as the number of 
transactions in T> that contain the pattern; and a pattern is called frequent 
in T) if its support exceeds a given minimal support threshold. The goal is 
now to hnd all frequent patterns in T). 

The search space of this problem, all subsets of X, is clearly huge. Instead 
of generating and counting the supports of all these patterns at once, several 
solutions have been proposed to perform a more directed search through all 
patterns. However, this directed search enforces several scans through the 
database, which brings up another great cost, because these databases tend 
to be very large, and hence they do not fit into main memory. 

The standard Apriori algorithm for solving this problem is based on 
its monotonicity property, that all subsets of a frequent pattern must be 
frequent. A pattern is thus considered potentially frequent, also called a 
candidate pattern, if its support is yet unknown, but all of its subsets are 
already known to be frequent. In every step of the algorithm, all candidate 
patterns are generated and their supports are then counted by performing a 
complete scan of the transaction database. This is repeated until no new can¬ 
didate patterns can be generated. Hence, the number of scans through the 
database equals the maximal size of a candidate pattern. Several improve¬ 
ments on the Apriori algorithm try to reduce the number of scans through 
the database by estimating the number of candidate patterns that can still 
be generated. 

At the heart of all these techniques lies the following purely combinatorial 
problem, that must be solved hrst before we can seriously start applying 
them: given the current set of frequent patterns at a certain pass of the 
algorithm, what is the maximal number of candidate patterns that can be 
generated in the passes yet to come? 

Our contribution is to solve this problem by providing a hard and tight 
combinatorial upper bound. By computing our upper bound after every 
pass of the algorithm, we have at all times a watertight guarantee on the 
size of what is still to come, on which we can then base various optimization 
decisions, depending on the specific algorithm that is used. 

In the next Section, we will discuss existing techniques to reduce the num¬ 
ber of database scans, and point out the dangers of using existing heuristics 
for this purpose. Using our upper bound, these techniques can be made 
watertight. In Section ^ we derive our upper bound, using a combinatorial 
result from the sixties by Kruskal and Katona. In Section we show how 
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to get even more out of this upper bound by applying it recursively. We 
will then generalize the given upper bounds such that they can be applied 
by a wider range of algorithms in Section In Section we discuss several 
issues concerning the implementation of the given upper bounds on top of 
Apriori-like algorithms. In Section ^ we give experimental results, showing 
the effectiveness of our result in estimating, far ahead, how much will still be 
generated in the future. Finally, we conclude the paper in Section ||. 


2 Related Work 


Nearly all frequent pattern mining algorithms developed after the proposal of 
the Apriori algorithm, rely on its levelwise candidate generation and pruning 
strategy. Most of them differ in how they generate and count candidate 
patterns. 

One of the hrst optimizations was the DHP algorithm proposed by Park 
et ah [^|. This algorithm uses a hashing scheme to collect upper bounds 
on the frequencies of the candidate patterns for the following pass. Patterns 
of which it is already known that they will turn up infrequent can then be 
eliminated from further consideration. The effectiveness of this technique 
only showed for the hrst few passes. Since our upper bound can be used to 
eliminate passes at the end, both techniques can be combined in the same 
algorithm. 

Other strategies, discussed next, try to reduce the number of passes. How¬ 
ever, such a reduction of passes often causes an increase in the number of 
candidate patterns that need to be explored during a single pass. This trade¬ 
off between the reduction of passes and the number of candidate patterns is 
important since the time needed to process a transaction is dependent on 
the number of candidates that are covered in that transaction, which might 
blow up exponentially. Our upper bound can be used to predict whether or 
not this blowup will occur. 


The Partition algorithm, proposed by Savasere et al. 


reduces the 


number of database passes to two. Towards this end, the database is parti¬ 
tioned into parts small enough to be handled in main memory. The partitions 
are then considered one at a time and all frequent patterns for that partition 
are generated using an Apriori-like algorithm. At the end of the first pass, 
all these patterns are merged to generate a set of all potential frequent pat¬ 
terns, which can then be counted over the complete database. Although this 
method performs only two database passes, its performance is heavily de¬ 
pendent on the distribution of the data, and could generate much too many 
candidates. 
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The sampling algorithm proposed by Toivonen performs at most two 
scans through the database by picking a random sample from the database, 
then finding all frequent patterns that probably hold in the whole database, 
and then verifying the results with the rest of the database. In the cases 
where the sampling method does not produce all frequent patterns, the miss¬ 
ing patterns can be found by generating all remaining potentially frequent 
patterns and verifying their frequencies during a second pass through the 
database. The probability of such a failure can be kept small by decreasing 
the minimal support threshold. However, for a reasonably small probability 
of failure, the threshold must be drastically decreased, which can again cause 
a combinatorial explosion of the number of candidate patterns. 

The Die algorithm, proposed by Brin et ah [|^], tries to reduce the 
number of passes over the database by dividing the database into intervals 
of a specific size. First, all candidate patterns of size 1 are generated. The 
frequencies of the candidate sets are then counted over the first interval of 
the database. Based on these frequencies, candidate patterns of size 2 are 
generated and are counted over the next interval together with the patterns 
of size 1. In general, after every interval k, candidate patterns of size k + 1 
are generated and counted. The algorithm stops if no more candidates can 
be generated. Again, this technique can be combined with our technique in 
the same algorithm. 

Another type of algorithms generate frequent patterns using a depth-first 
search [^, Generating patterns in a depth-first manner implies 

that the monotonicity property cannot be exploited anymore. Hence, a lot 
more patterns will be generated and need to be counted, compared to the 
breadth-first algorithms. The FPgrowth algorithm from Han et ah solves this 
problem by loading a compressed form of the database in main memory using 
the proposed FPtree. This memory-resident FPtree benefits from a very fast 
counting mechanism of all generated patterns.^ Obviously, it is not always 
possible to load the compressed form of the database into main memory. 

Other strategies try to push certain constraints into the candidate pattern 
generation as deeply as possible to reduce the number of candidate patterns 
that must be generated [T^, 0 0 0. Still others try to find only the 
set of maximal frequent patterns, i.e. those frequent patterns that have no 
superset which is also frequent Q. Of course, these techniques do 

not give us all frequencies of all frequent patterns as required by the general 
pattern mining problem we consider in this paper. 

The first heuristic specifically proposed to estimate the number of can¬ 
didate patterns that can still be generated was used in the AprioriHybrid 


^Note that the patterns in the FPtree are represented in the so called header tables. 
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algorithm § 0 . This algorithm uses Apriori in the initial iterations and 
switches to AprioriTid if it expects it to run faster. This AprioriTid algo¬ 
rithm does not use the database at all for counting the support of candidate 
patterns. Rather, an encoding of the candidate patterns used in the previous 
iteration is employed for this purpose. The AprioriHybrid algorithm switches 
to AprioriTid when it expects this encoding of the candidate patterns to be 
small enough to fit in main memory. The size of the encoding grows with the 
number of candidate patterns. Therefore, it calculates the size the encoding 
would have in the current iteration. If this size is small enough and there were 
fewer candidate patterns in the current iteration than the previous iteration, 
the heuristic decides to switch to AprioriTid. 

This heuristic (like all heuristics) is not waterproof, however. Take, for 
example, two disjoint datasets. The hrst dataset consists of all subsets of 
a frequent pattern of size 20. The second dataset consists of all subsets of 
1 000 disjoint frequent patterns of size 5. If we merge these two datasets, we 
get (g°) -|- 1000 ( 3 ) ~ 11140 patterns of size 3 and -|- lOOO(^) = 9 845 
patterns of size 4. If we have enough memory to store the encoding for 
all these patterns, then the heuristic decides to switch to AprioriTid. This 
decision is premature, however, because the number of new patterns in each 
pass will start growing exponentially afterwards. 

Also, current state-of-the-art algorithms for frequent itemset mining, such 
as Opportunistic Project 


23 and DCI 


use several techniques within the 
same algorithm and switch between these techniques using several simple, 
but not waterproof heuristics. 

Another improvement of the Apriori algorithm, which is part of the folk¬ 
lore, tries to combine as many iterations as possible in the end, when only 
few candidate patterns can still be generated. The potential of such a combi¬ 
nation technique was realized early on [^, but the modalities under which 
it can be applied were never further examined. Our work does exactly that. 


3 The basic upper bounds 

In all that follows, L is some family of patterns of size k. 

Definition 1. A candidate pattern for L is a pattern (of size larger than k) 
of which all A;-subsets are in L. For a given p > 0, we denote the set of all 
size-A: +p candidate patterns for L by Ck+p{L). 

For any p > 1, we will provide an upper bound on \Ck+p{L)\ in terms of 
\L\. The following lemma is central to our approach. (A simple proof was 
given by Katona [0.) 
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Lemma 1. Given n and k, there exists a unique representation 


n = 


rrik 

k 


+ 


rrik-i 

k-1 


+ ■ ■ ■ + 


m, 

r 


with r >1, ruk > nik-i > ... > rUj., and mi> i for i = r,r + 1,..., k. 

This representation is called the k-canonical representation of n and can 
be computed as follows: The integer satisfies < n < the 

integer nik-i satisfies < n — , and so on, until n — 

(?) - (?--.■)- ( 7 ) « 

We now establish: 


Theorem 2. If 


\L\ = 


ruk 

k 


+ 


rrik-i 

k-l 


+ ■■■ + 


m. 


in k-canonical representation, then 


\Ck+p{L)\ < 


rUk 
k + p 


+ 


+ ■■■ + 


rus+i 

s + p + 1 


mk-i 

Jv — 1+ p^ 

where s is the smallest integer such that rUg < s+p. If no such integer exists, 
we set s = r — 1. 


Proof. Suppose, for the sake of contradiction, that 


\Ck+p{L)\ > 


rUk 
k + p 


+ 


rrik-i 
k — 1 + p 


+ ■■■ + 


mg+i 

s + P+1 


+ 


s + p 
s + p 


Note that this is in A: + p-canonical representation. A theorem by Kruskal 


and Katona [^, [ 

\L\ > 


|, ^ says that 

+ 


TUk 

k 


ruk-i 

k-l 


+ 


* CTO * 


But this is impossible, because 


\L\ = 


< 


< 


rrik 

k 

rUk 

k 

rUk 

k 

rrik 

k 


+ 


+ 


+ 


+ 


rrik-i 

k-l 

mk-i 

k-l 

rrik-i 

k-l 

rrik-i 

k-l 


+ ■■■ + 


+ ■■■ + 


+ ■■■ + 


+ ■ ■ ■ + 


V^ + V 

/ mg+i\ 

v^+iy 

/ rUg+A 

V^ + V 


+ 


m. 


l<i<s 

E 


s + 1 


+ 


0<i<s 
S+p 
S 


S+p 
S 


+ ■■■ + 

i + p — 1 
i 

i + p — 1 
i 


m. 
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The first inequality follows from the observation that rus < s + p — 1 implies 
< i + p — 1 for all z = s, s — 1,..., r. The last equality follows from a 
well-known binomial identity. □ 


Notation We will refer to the upper bound provided by the above theorem 
as KK^'^^{\L\) (for Kruskal-Katona). The subscript k, the level at which we 
are predicting, is important, as the only parameter is the cardinality \L\ of 
L, not L itself. The superscript k + p denotes the level we are predicting. 


Proposition 3 (Tightness). The upper bound provided by Theorem is 
tight: for any given n and k there always exists an L with \L\ = n such that 
for any given p, |Cfc+p(L)| = {\L\). 

Proof. Let us write a finite set of natural numbers as a string of natural 
numbers by writing its members in decreasing order. We can then compare 
two such sets by comparing their strings in lexicographic order. The resulting 
order on the sets is known as the colexicographic (or colex) order. An intuitive 
proof of the Kruskal-Katona theorem, based on this colex order, was given 


by Bollobas [10|. Let 


CPC:-;) 


+ ■■■ + 



be the A:-canonical representation of n. Then, Bollobas has shown that all 
k — p-subsets of the hrst n A;-sets of natural numbers in colex order, are 
exactly the hrst 


f \ ( ^k-\ \ ( \ 

(fc-pj ^ (fc-l-pj (r-J 

k — p-sets of natural numbers in colex order, with s the smallest integer such 
that s > p. Using the same reasoning as above, we can conclude that all 
k + p-supersets of the hrst n k-sets of natural numbers in colex order are 
exactly the hrst KK^'^^{n) k -|-p-sets of natural numbers in colex order. □ 

Analogous tightness properties hold for all upper bounds we will present 
in this paper, but we will no longer explicitly state this. 

Example 1. Let L be the set of 13 patterns of size 3: 

{{3, 2,1}, {4, 2,1}, {4, 3,1}, {4, 3, 2}, 

{5, 2,1}, {5, 3,1}, {5, 3, 2}, {5,4,1}, {5,4, 2}, {5,4,3}, 
{6,2,1},{6,3,1},{6,3,2}}. 
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The 3-canonical representation of 13 is ( 3 ) + ( 2 ) and hence the maximum 
number of candidate patterns of size 4 is = (^) + ( 3 ) = 6 and the 

maximum number of candidate patterns of size 5 is = ( 3 ) = 1. This 

is tight indeed, because 

^(L) = {{4,3,2,1},{5,3,2,1},{5,4,2,1}, 

{5,4, 3,1},{5,4,3,2},{6,3,2,1}} 

and 

C' 5 (L) = {{5,4,3,2,1}}. 

Estimating the number of levels The /c-canonical representation of |L| 
also yields an upper bound on the maximal size of a candidate pattern, 
denoted by maxsize(L). Recall that this size equals the number of iterations 
the standard Apriori algorithm will perform. Indeed, since \L\ < 
there cannot be a candidate pattern of size nik + 1 or higher, so: 

Proposition 4. If is the first term in the k-canonical representation of 
\L\, then maxsize(L) < 

We denote this number by /ifc(|L|). From the form of as given 

by Theorem it is immediate that p also tells us the last level before which 
KK becomes zero. Formally: 

Proposition 5. 

Pk{\L\) = k + min{p | AAJ;+^(|L|) = 0 } - 1 . 

Estimating all levels As a result of the above, we can also bound, at any 
given level k, the total number of candidate patterns that can be generated, 
as follows: 

Proposition 6. The total number of candidate patterns that can he generated 
from a set L of k-patterns is at most 

KKr’‘{\L\):=Y,rr'l*’’{\L\). 

p>l 

4 Improved upper bounds 

The upper bound KK on itself is neat and simple as it takes as parameters 
only two numbers: the current size fc, and the number |T| of current frequent 
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patterns. However, in reality, when we have arrived at a certain level k, 
we do not merely have the cardinality: we have the actual set L of current 
A:-patterns! For example, if the frequent patterns in the current pass are 
all disjoint, our current upper bound will still estimate their number to a 
certain non-zero hgure. However, by the pairwise disjointness, it is clear that 
no further patterns will be possible at all. In sum, because we have richer 
information than a mere cardinality, we should be able to get a better upper 
bound. 

To get inspiration, let us recall that the candidate generation process of 
the Apriori algorithm works in two steps. In the join step, we join L with 
itself to obtain a superset of Ck+i- The union pU q of two patterns p,q G L 
is inserted in Ck+i if they share their A: — 1 smallest items: 

insert into Ck+i 

select p[l],p[2],... ,p[k],q[k] 

from Lk p, Lk q 

where p[l] = g[l], ..., p[k — 1 ] = q[k — 1 ], p[k] < q[k] 

Next, in the prune step, we delete every pattern c G Ck+i such that some 
A:-subset of c is not in L. 

Let us now take a closer look at the join step from another point of view. 
Consider a family of all frequent patterns of size k that share their k — 1 
smallest items, and let its cardinality be n. If we now remove from each of 
these patterns all these shared A: — 1 smallest items, we get exactly n distinct 
single-item patterns. The number of pairs that can be formed from these 
single items, being ( 2 ), is exactly the number of candidates the join step will 
generate for the family under consideration. We thus get an obvious upper 
bound on the total number of candidates by taking the sum of all (” 2 ^), for 
every possible family /. 

This obvious upper bound on ICfc+il, which we denote by obviousk+i{L), 
can be recursively computed in the following manner. Let I denote the set 
of items occurring in L. For an arbitrary item x, define the set as 

= {s — {x} \ s E L and x = min s}. 

Then 

((\L 

obviousk+i{L) := < \ 2 

IZlxG/ obviousk{L^) if A: > 1. 

This upper bound is much too crude, however, because it does not take 
the prune step into account, only the join step. The join step only checks 
two A;-subsets of a potential candidate instead of all A; -|- 1 A:-subsets. 
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However, we can generalize this method snch that more snbsets will be 
considered. Indeed, instead of taking a family of all freqnent patterns sharing 
their k — 1 smallest items, we can take all freqnent patterns sharing only their 
k' smallest items, for some k' < k — 1. If we then remove these k' shared 
items from each pattern in the family, we get a new set V of n patterns of 
size k — k'. If we now consider the set C of candidates (of size k — k' + 1) 
for L', and add back to each of them the previonsly removed k' items, we 
obtain a prnned set of candidates of size A; + 1, where instead of just two (as 
in the join step). A: — A:' + 1 of the A:-subsets were checked in the pruning. 
Note that we can get the estimate KK^Zki^^{\Li'\) on the cardinality of C 
from our upper bound Theorem |^. 

Doing this for all possible values of k' yields an improved upper bound 
on ICfc+il, which we denote by improved, and which is computed by 
rehning the recursive procedure for the obvious upper bound as follows: 

((\L 

improved ;= < \ 2 

improved if A: > 1. 



Actually, as in the previous section, we can do this not only to estimate 
ICfc+il, but also more generally to estimate jCfc+pl for any p>l. Henceforth 
we will denote our general improved upper bound by KKl.^p{L). The general 
definition is as follows: 


kku,(l) 


KKp-^m) = 

mi,i{i(-iry''(|L|), E.,, if > 1. 


(For the base case, note that KK^^^{\L\)^ when A: = 1, is nothing but 

By dehnition, KKl_^p is always smaller than . We now prove for¬ 

mally that it is still an upper bound on the number of candidate patterns of 
size k + p: 


Theorem 7. 

|C',+,(L)| < KKl^piL). 

Proof. By induction on k. The base case A: = 1 is clear. For A: > 1, it suffices 
to show that for all p > 0 


Ck+p{L) C U Ck-\-p—i (L-) + X. (1) 

x£l 


(For any set of patterns H, we denote {h U {x} \ h G H} hj H + x.) 
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From the above containment we can conclnde 


lft+p(L)|<lU Ck+p—\{^L ) + x| 

x^I 

^ \Ck+p-i{L^) + 3^1 

xei 

= ZlCi+p-i(n)l 

x&l 

xei 

where the last ineqnality is by indnction. 

To show (|l]), we need to show that for every p > 0 and every s G Ck+p{L), 
s — {x} G Ck+p-i{L^), where x = mins. This means that every snbset of 
s — {x} of size k — 1 mnst be an element of L^. Let s — {x} — {yi,..., Hp] be 
snch a snbset. This snbset is an element of iff s — {yi,..., ^p} G L and 
X = min(s — {?/i,..., Vp}). The hrst condition follows from s G Ck+p{L), and 
the second condition is trivial. Hence the theorem. □ 

A natural question is why we must take the minimum in the dehnition 
of KK*. The answer is that the two terms of which we take the minimum 
are incomparable. The example of an L where all patterns are pairwise 
disjoint, already mentioned in the beginning of this section, shows that, for 
example, {\L\) can be larger than the summation Xlxe/ But 

the converse is also possible: consider L = {{1, 2}, {1, 3}}. Then KK\{L) = 
0 , but the summation yields 1 . 

Example 2. Let L consist of {5, 7, 8} and {5, 8, 9} plus all 19 3-subsets of 
{1,2, 3,4, 5} and (3,4, 5, 6 , 7}. Because 21 = ( 3 ) -f ( 2 ), we have KK\{21) = 
15, KK\{21) = 6 and KK\{21) = 1. On the other hand, 

KKl{L) = KKl{L^) + KKl{L^) + KKl{L^) + KKl{L^) 

+ KKl{{L^f) + KKl{{Ly) + KKl{{L^f) + KKl{{L^f) 

+ KKl{L^) + KKl{L^) + KKl{L^) + KKl{L^) 

= 4 + l + 4 + l + 0 + --- + 0 
= 10 
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and 


KKl{L) = KKl{L^) + KKl{L^) + KKl{L^) + KKl{L^) 

+ KKl{{L^f) + KK;{{Ly) + KKl{{L^f) + KKl{{L^f) 

+ KKl{L^) + KKl{L^) + KKl{L^) + KKl{L^) 

= 1 + 0 + 1 + 0 + 0 + --- + 0 
= 2 . 

Indeed, we have 10 4-snbsets of {1,2, 3,4, 5} and (3,4, 5, 6 , 7}, and the two 
5-sets themselves. 

We can also improve the npper bonnd ;Ltfc(|L|) on maxsize(L). In analogy 
with Proposition |^, we dehne: 

■=k + min{p | KKl^p{L) = 0 } - 1 . 

We then have: 

Proposition 8. 

maxsize(L) < < Hk{L). 

We hnally nse Theorem for improving the npper bonnd on the 

total nnmber of candidate patterns. We dehne: 

:= ^ KKl^(L). 

p>l 


Then we have: 

Proposition 9. The total number of candidate patterns that can he generated 
from a set L of k-patterns is bounded by KKl^^^fL). Moreover, 

KKIUL) < KKf^^\L). 


5 Generalized upper bounds 


The npper bonnds presented in the previons sections work well for algorithms 
that generate and test candidate patterns of one specihc size at a time. 
However, a lot of algorithms generate and test patterns of different sizes 
within the same pass of the algorithm ||II|, |], |^. Hence, these algorithms 
know in advance that several patterns of size larger than k are freqnent or 
not. Since onr npper bonnd is solely based on the patterns of a certain length 
k, it does not nse information abont patterns of length larger than k. 
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Nevertheless, these larger sets could give crucial information. More specif¬ 
ically, suppose we have generated all frequent patterns of size /c, and we also 
already know in advance that a certain set of size larger than k is not fre¬ 
quent. Our upper bound on the total number of candidate patterns that 
can still be generated, would disregard this information. We will therefore 
generalize our upper bound such that it will also incorporate this additional 
information. 


5.1 Generalized iCi^-bound 

From now on, L is some family of sets of patterns Lk, Lk+i-, • • •, Lk+q which 
are known to be frequent, such that Lk+p contains patterns of size k + p, and 
ail k + p — 1-subsets of all patterns in L^+p are in L^+p-i. We denote by \L\ 
the sequence of numbers |Lfc|, \ Lk+i\, ..., \Lk+q\. 

Similarly, let J be a family of sets of patterns Ik, h+i, • • •, h+q which are 
known to be infrequent, such that Ik+p contains patterns of size k + p and all 
k + p — 1-subsets of all patterns in Ik+p are in Lk+p-i. We denote by |/| the 
sequence of numbers \Ik\, \Ik+i\, ■ ■ ■, \h+q\- Note that for each p > 0, Lk+p 
and Ik+p are disjoint. 

Before we present the general upper bounds, we also generalize our notion 
of a candidate pattern. 

Definition 2. A candidate pattern for (L, J) of size A: -|-p is a pattern which 
is not in Lk+p or Ik+p, all of its /c-subsets are in Lk, and none of its subsets 
of size larger than k is included in Ik U Ik+i U ■ ■ ■ U Ik+q- For a given p, we 
denote the set of all k + p-size candidate patterns for (L, I) by Ck+p{L, I). 

We note: 


Lemma 10. 


Ck+p{L, I) 


Ck+ii^Lk^ \ (^Lk+i U Ik+i) if p 1, 

Ck+p(^Ck+p—i(^L, /) U Lk+p—i^ \ (yLk+p U Ik+p) if P ^ 1- 


Proof. The case p = 1 is clear. For p > 1, we show the inclusion in both 
directions. 


L For every set in Ck+p{Ck+p-i{L, I) U Lk+p-i), we know that all of its 
A:-subsets are always contained in a k + p — 1 subset, and these are 
in Ck+p-i{L, I) U Lk+p-i. By dehnition, we know that for every set 
in Ck+p-i{L, I), all of its /c-subsets are in Lk. Also, for every set in 
Lk+p-i, all of its /c-subsets are in Lk. By dehnition, for every set in 
Ck+p-i{L, I), all of its /c -|-p — /-subsets are not in Ik+p-r Also, for 


12 



every set in Lfc+p_i, all of its /c + p — z-subsets are in Lk+p-i and hence 
they are not in Ik+p-i since they are disjoint. By dehnition, none of 
the patterns in U Ik+p are in Ck+p{L, I). 

C It suffices to show that for every set in Ck+p{L, I), every fc+p —1-subset 
s is in C'fc+p_i(L,/) U Lk+p-i. Obviously, this is true, since if it is not 
already in Lk+p-i, still all fc-subsets of s must be in L*,, s can not be 
in /fc+p-i and none of its subsets can be in any h+p-e with i > 1. 


□ 


Hence, we dehne 

gKKl*’’(\L\,\I\) ~ 

I /CA-p'ditl) - |Lt+i| - |4+i| if p = 1; 

-^fc+p—1|) l-^fc+pl if p > 1, 

and obtain: 

Theorem 11. 

ict+,(L,/)i < |/|) < - m+pi - i4+pi. 

Proof. The hrst inequality is clear by Lemma m The second inequality is 
by induction on p. The base case p = 1 is by dehnition. For p > 1, we have: 

<,44^(141, |/|) = KKtZ_,{iKKl*--\\L\, |/|) + |Lp+,_,|) 

< - 14+p-il) - \Lt+,\ - IWI 

< A'A'f+y,(/fA'y'-(ini)) - lip+pl - 14+pi 

= KKl+’’(\Lt\) - |Lt+d - |4+p| 


where the hrst inequality is by induction and because of the monotonicity of 
KK, the second inequality also because of the monotonicity of KK and the 
last equality follows from 


□ 
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Again, we can also generalize the upper bound on the maximal size of a 
candidate pattern, denoted by maxsize(L, J), and the upper bound on the 
total number of candidate patterns, both also incorporating 

9^^m. |/|) - fc + mm{p I <)A'A'y>'(|L|, |/|) = 0} - 1 
jA'AT'dil. |/|) := E sA'A-yniAI. |/|). 

p>l 

We obtain: 

Proposition 12. 


maxsize(L, I) < gfM{\L\, |/|) < g{\L\). 

Proposition 13. The total number of candidate patterns that can be gener¬ 
ated from {L,I) is bounded by gKK]f^^^{\L\, |/|). Moreover, 

gKKf^^\\L\,\I\) < KKf^^\\L,\). 

Example 3. Suppose L 3 consists of all subsets of size 3 of the set {1, 2, 3,4, 
5,6}. Now assume we already know that 14 contains patterns {1,2, 3,4} 
and (3,4,5, 6 }. The KK upper bound presented in the previous section 
would estimate the number of candidate patterns of sizes 4, 5, and 6 to be 
at most (®) = 15, ( 5 ) = 6 , and (g) = 1 respectively. Nevertheless, using 
the additional information, gKK can already reduce these numbers to 13,3, 
and 0 . Also, p would predict the maximal size of a candidate pattern to 
be 6 , while gp can already predict this number to be at most 5. Similarly, 
A"Atotai would predict the total number of candidate patterns that can still be 
generated to be at most 22 , while can already deduce this number 

to be at most 16. 

5.2 Generalized i^i^^-bound 

Using the generalized basic upper bound, we can now also generalize our 
improved upper bound KK*. For an arbitrary item x, dehne the family of 
sets A* as LJ, 1 %^,,and F as /J, .... 1^^,. We define: 

sl<KU,(r 1) ■= 

(gKKl+<‘(\L\,\I\) = 

\min{<,A'ir^''(|L|,|/|),j:.« 9 A'A'J^.p_,(L-,/-)} if fc > 1 , 

We then have: 
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Theorem 14. 


|Ci+,(L,;)| < aKKl^(L,I) < KKU^d,) I I I Ifc+p I • 

Proof. The proof of the hrst inequality is similar to the proof of Theorem |^, 
instead that we now need to show that for all p > 0, 

Ck+p{L, J) C IJ Ck+p-i{L^, n + X. 

x£l 

Therefore, we need to show for every s G Ck+p{L, I), s— {x} G Ck+p-i{L^ , I^), 
where x = min s. First, this means that every subset of s — {x} of size k — 1 
must be in L^. Let s — {x} — {pi, ..., i/p} be such a subset. This subset is an 
element of if and only if s— {yi, ..., yp} G and x = min(s— {pi, ..., Vp})- 
The hrst condition follows from s G Ck-\-p{L, I), and the second condition 
is trivial. Second, we need to show that s — {x} is not in Since 

s G Ck+p{L, I), s is not in Lk+p and hence s — {x} cannot be in 
Finally, we need to show that none of the subsets of s — {x} of size greater 
than k — 1 are in ..., Ifj^p_i. Let s — {x} — {pi, ..., ym} be such a 
subset. Since s G Ck+p{L, I), s — {yi, ..., ym} is not in Ik+p-m, and hence 
s - {x} - {yi, ..., ym} cannot be in If+p_m- 

We prove the second inequality by induction on k. The base case k = 1 
is clear. For all A: > 0, we have 


gKKU^(LJ) 

= min{gKKl^{\Ll |/|), ^ /')} 

xei 

< - |4+,|, J] “ I4'+,I} 

xei 

= mm{KKl+'‘(\L\),'£KKU^_,{L-)} - |L,+„| - |4+,| 

x£l 

KK j^^p(^Lk) |Z//j_|_p| |f^fc+p| 


where the left hand side of the minimum in the inequality is by Theorem |n 
and the right hand side is by induction. □ 


Again, we get an upper bound on maxsize(L, I): 

gy*{L, I) ■.= k + mm{p \ gKKl^^{L, /) = 0} - 1, 
and on the total number of candidate patterns that can still be generated: 

p>l 

We then have the following analogous propositions to S and 0: 
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Proposition 15. 


maxsize(L,/) < gfi*{L,I) < g*{L). 

Proposition 16. The total number of candidate patterns that can he gener¬ 
ated from {L,I) is bounded by gKKl^^^fL, I) . Moreover, 

sKKIULO) < 

Example 4. Consider the same set of patterns as in the previous example, 
he., L 3 consists of all subsets of size 3 of the set {1, 2, 3,4, 5, 6 } and {1, 2, 3,4} 
and {3,4, 5, 6 } are included in I4. The KK* upper bound presented in the 
previous section would also estimate the number of candidate patterns of 
sizes 4, 5, and 6 to be at most Q) = 15, (®) = 6 , and (g) = 1 respectively. 
Nevertheless, using the additional information, gKK* can perfectly predict 
these numbers to be 13, 2 , and 0. Again, g* would predict the maximal size 
of a candidate pattern to be 6 , while gg* can already predict this number to 
be at most 5. Similarly, would predict the total number of candidate 

patterns that can still be generated to be at most 22 , while can 

already deduce this number to be at most 15. 


6 Efficient Implementation 


For simplicity reasons, we will restrict ourselves to the explanation of how the 
improved upper bounds can be implemented. The proposed implementation 
can be easily extended to support the computation of the general upper 
bounds. 

To evaluate our upper bounds we implemented an optimized version of the 
Apriori algorithm using a trie data structure to store all generated patterns, 
similar to the one described by Brin et al. [11|. This trie structure makes it 


cheap and straightforward to implement the computation of all upper bounds. 
Indeed, a top-level subtrie (rooted at some singleton pattern {x}) represents 
exactly the set we dehned in Section Every top-level subtrie of this 
subtrie (rooted at some two-element pattern {x, y}) then represents (T®)^, 
and so on. Hence, we can compute the recursive bounds while traversing the 
trie, after the frequencies of all candidate patterns are counted, and we have 
to traverse the trie once more to remove all candidate patterns that turned 
out to be infrequent. This can be done as follows. 

Remember, at that point, we have the current set of frequent patterns of 
size k stored in the trie. For every node at depth d smaller than k, we compute 
the fc — d-canonical representation of the number of descendants this node has 
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at depth k, which can be used to compute fik-d (cf. Proposition H), KK^j^_^ 
for any i < ^k-d (cf. Theorem H) and hence also (cf. Proposition |]). 

For every node at depth k — 1, its KK* and /i* values are equal to its KK 
and /i values respectively. Then compute for every p > 0, the sum of the 
values of all its children, and let be the smallest of 

this sum and until this minimum becomes zero, which also gives 

us the value of fi*. Finally, we can compute for this node. If this is 

done for every node, traversed in a depth-hrst manner, then hnally the root 
node will contain the upper bounds on the number of candidate patterns 
that can still be generated, and on the maximum size of any such pattern. 
The soundness and completeness of this method follows directly from the 
theorems and propositions of the previous sections. 

We should also point out that, since the numbers involved can become 
exponentially large (in the number of items), an implementation should take 
care to use arbitrary-length integers such as provided by standard mathemat¬ 
ical packages. Since the length of an integer is only logarithmic in its value, 
the lengths of the numbers involved will remain polynomially bounded. 


7 Experimental Evaluation 


All experiments were performed on a 400MHz Sun Ultra Sparc with 512 MB 
main memory, running Sun Solaris 8. The algorithm was implemented in 
C-|--|- and uses the GNU MP library for arbitrary-length integers . 


Data sets We have experimented using three real data sets, of which two 
are publicly available, and one synthetic data set generated by the program 
provided by the Quest research group at IBM Almaden P|. The mushroom 
data set contains characteristics of various species of mushrooms, and was 
originally obtained from the UCI repository of machine learning databases 
The BMS-WebView-1 data set contains several months worth of clickstream 
data from an e-commerce web site, and is made publicly available by Blue 

PI 


Martini Software 


The basket data set contains transactions from a 


Belgian retail store, but can unfortunately not be made publicly available. 
Table shows the number of items and the number of transactions in each 
data set. The table additionally shows the minimal support threshold we used 
in our experiments for each data set, together with the resulting number of 
iterations and the time (in seconds) which the Apriori algorithm needed to 
hnd all frequent patterns. 

The results from the experiment with the real data sets were not immedi¬ 
ately as good as the results from the synthetic data set. The reason for this. 
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Data set 

#Items 

^(^Transactions 

MinSup 

#It’s 

Time 

T40I10D100K 

1000 

100 000 

700 

18 

1700s 

mushroom 

120 

8124 

813 

16 

663s 

BMS-Webview-1 

498 

59 602 

36 

15 

86 s 

basket 

13103 

41373 

5 

11 

43s 


Table 1: Database Characteristics 


however, turned out to be the bad ordering of the items, as explained next. 


Reordering From the form of L^, it can be seen that the order of the items 
can affect the recursive upper bounds. By computing the upper bound only 
for a subset of all frequent patterns (namely L^), we win by incorporating 
the structure of the current collection of frequent patterns, but we also lose 
some information. Indeed, whenever we recursively restrict ourselves to a 
subtrie L®, then for every candidate pattern s with x = mins, we lose the 
information about exactly one subpattern in L, namely s — x. 

We therefore would like to make it likely that many of these excluded 
patterns are frequent. A good heuristic, which has already been used for 

im, 


several other optimizations in frequent pattern mining 


is to force the 


most frequent items to appear in the most candidate patterns, by reordering 
the single item patterns in increasing order of frequency. 

After reordering the items in the real life data set, using this heuristic, the 
results became very analogous with the results using the synthetic datasets. 


Efficiency The cost for the computation of the upper bounds is negligible 
compared to the cost of the complete algorithm. Indeed, the time T needed 
to calculate the upper bounds is largely dictated by the number n of currently 
known frequent sets. We have shown experimentally that T scales linearly 
with n. Moreover, the constant factor in our implementation is very small 
(around 0.00001). We ran several experiments using the different data sets 
and varying minimal support thresholds. After every pass of the algorithm, 
we registered the number of known frequent sets and the time spent to com¬ 
pute all upper bounds, resulting in 145 different data points. Figure |l| shows 
these results. 


Upper bounds 

• Figure | shows, after each level k, the computed upper bound KK and 
improved upper bound KK* for the number of candidate patterns of 
size k + 1, as well as the actual number jCfc+il it turned out to be. 
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Figure 1: Time needed to compute upper bounds is linear in the number of 
nodes. 


We omitted the upper bound for fc + 1 = 2, since the upper bound on 
the number of candidate patterns of size 2 is simply (' 2 '), with \L\ the 
number of frequent items. 

• Figure || shows the upper bounds on the total number of candidate 
patterns that could still be generated, compared to the actual number 
of candidate patterns, ICtotail, that were effectively generated. Again, 
we omitted the upper bound for k = 1, since this number is simply 
21-^1 — \L\ — 1, with \L\ the number of frequent items. 

• Figure ^ shows the computed upper bounds p and p* on the maximal 
size of a candidate pattern. Also here we omitted the result for k = 1, 
since this number is exactly the number of frequent items. 

The results are pleasantly surprising: 

• Note that the improvement of KK* over KK, and of /r* over /z, antici¬ 
pated by our theoretical discussion, is indeed dramatic. 

• Comparing the computed upper bounds with the actual numbers, we 

observe the high accuracy of the estimations given by KK*. Indeed, 
the estimations of match almost exactly the actual number of 

candidate patterns that has been generated at level k + 1. Also note 
that the number of candidate patterns in T40I10D100K is decreasing 
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(a) basket 



(b) BMS-Webview-1 


Figure 2: Actual and estimated number of candidate patterns. 
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(c) T40I10D100K 



(d) mushroom 


Figure 2: Actual and estimated number of candidate patterns. 
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(a) basket 



(b) BMS-Webview-1 

Figure 3: Actual and estimated total number of future candidate patterns. 
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(c) T40I10D100K 



(d) mushroom 


Figure 3: Actual and estimated total number of future candidate patterns. 
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Figure 4: Estimated size of the largest possible candidate pattern. 
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Figure 4: Estimated size of the largest possible candidate pattern. 
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in the first four iterations and then increases again. This perfectly il¬ 
lustrates that the heuristic used for AprioriHybrid, as explained in the 
related work section, would not work on this data set. Indeed, any 
algorithm that exploits the fact that the current number of candidate 
patterns is small enough and there were fewer candidate patterns in 
the current iteration than in the previous iteration, would falsely inter¬ 
pret these observations, since the number of candidate patterns in the 
next iterations increases again. The presented upper bounds perfectly 
predict this increase. 

• The upper bounds on the total number of candidate patterns are still 
very large when estimated in the hrst few passes, which is not surprising 
because at these initial stages, there is not much information yet. For 
the mushroom and the artihcial data sets, the upper bound is almost 
exact when the frequent patterns of size 3 are known. For the basket 
data set, this result is obtained when the frequent patterns of size 4 are 
known and size 6 for the BMS-Webview-1 data set. 

• We also performed experiments for varying minimal support thresholds. 
The results obtained from these experiments were entirely similar to 
those presented above. 

Combining iterations As discussed in the Introduction, the proposed 
upper bound can be used to protect several improvements of the Apriori 
algorithm from generating too many candidate patterns. One such improve¬ 
ment tries to combine as many iterations as possible in the end, when only 
few candidate patterns can still be generated. We have implemented this 
technique within our implementation of the Apriori algorithm. 

We performed several experiments on each data set and limited the num¬ 
ber of candidate patterns that is allowed to be generated. If the upper bound 
on the total number of candidate patterns is below this limit, the algorithm 
generates and counts all possible candidate patterns within the next iter¬ 
ation. Figure ^ shows the results. The x-axis shows the total number of 
iterations in which the algorithm completed, and the y-a.xis shows the total 
time the algorithm needed to complete. As can be seen, for all datasets, the 
algorithm can already combine all remaining iterations into one very early in 
the algorithm. For example, the BMS-Webview-1 dataset, which normally 
performs 15 iteration, could be reduced to six iterations to give an optimal 
performance. If the algorithm already generated all remaining candidate pat¬ 
terns in the hfth iteration, the number of candidate patterns that turned out 
to be infrequent was too large, such that the gain of reducing iterations has 
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iteration 
(a) basket 



iteration 

(b) BMS-Webview-1 

Figure 5: Combining iterations. 
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iteration 

(c) T40I10D100K 



(d) mushroom 

Figure 5: Combining iterations. 
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been consumed by the time needed to count all these candidate patterns. 
Nevertheless, it is still more effective than not combining any passes at all. If 
we allowed the generation of all candidate patterns to occur in even earlier it¬ 
erations, although the upper bound predicted a to large number of candidate 
patterns, this number became indeed to large keep in main memory. 


8 Conclusion 

Motivated by several heuristics to reduce the number of database scans in 
the context of frequent pattern mining, we provide a hard and tight combi¬ 
natorial upper bound on the number of candidate patterns and on the size of 
the largest possible candidate pattern, given a set of frequent patterns. Our 
hndings are not restricted to a single algorithm, but can be applied to any 
frequent pattern mining algorithm which is based on the levelwise genera¬ 
tion of candidate patterns. Using the standard Apriori algorithm, on which 
most frequent pattern mining algorithms are based, our experiments showed 
that these upper bounds can be used to considerably reduce the number of 
database scans without taking the risk of getting a combinatorial explosion 
of the number of candidate patterns. 
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